The general approach to understanding the relationships in the dataset, particularly with respect to the quality of wine, is the following: 1. Look at the basic structure of the dataset in terms of number of variables, data types, N.A. values and so on. 2. Print basic statistics and histograms variables in the dataset to understand their distribution 3. Conduct bivariate analyses to understand the relationship between the different variables in the dataset. 4. Based on the preliminary findings of the bivariate analyses, the next step is to conduct multivariate analyses to further understand relationships in the dataset with the aim to uncover some insights related to the factors behind the quality of wine.
So, the first step is to load the data and check the structure of the dataset.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Right, there seems to be the need for some simple data manipulation (i.e. remove X and factorize quality)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
First step…let’s look at some basic statistics
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol quality
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 3: 10
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 4: 53
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20 5:681
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42 6:638
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 7:199
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 8: 18
Ok, so right away we notice that a few variable have some values that can be seen as outliers. For example, residual.sugar has a 3rd quartile value of 2.6 and the max value of this variable is 15.5; That is almost 6 times the distance.
We find a similar pattern with chlorides and total.sulfur.dioxide. We will be able to view this graphically with histograms, which is the next step.
There are way more wine datapoints with quality level of 5 and 6 than the rest of the ratings. Also, there are no wines below 3 and none above 6. Could this have an effect on the quality of the derived statistics or could affect the robustness of any model applied to this data?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
the variable sulphates seems to be right skewed and the median value is 0.62. Let’s transform it using a log10 scale.
## max value is 10.99347 s.d. away
Residual sugar also seems to have a right skewness due to some value being almost 11 standard deviations away from the mean. Again, let’s plot it using a log-transformation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH seems to be more normally distributed.
## max value is 13.98185 s.d. away
alcohol is also right skewed. It’s max value is almost 14 s.d. away. Let’s plot it using a log-transformation.
## [1] 215 59 44 55 65 42 39 28 38 46 60 76 65 39 51 62 49
## [18] 33 33 57 45 38 41 41 88 30 27 20 18 17 3 19 21 13
## [35] 6 2 7 4 1 1 0 0 0 0 0 0 0 0 0 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
hm… citric acid has an odd shape. it seems most of the observations have a value very close to zero, followed by 0.5. Also there seems to be an outlier observation.
Log-transforming it doesn’t seem to add much information either.
quality is a factor variable which represents the quality of the wine graded on a scale between 0 (very bad) and 10 (very excellent). It seems that the majority of the observations received a grading between 5 and 6.volatile.acidity, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and sulphatesAfter reading the dataset documentation, plotting some variables and from previous, albeit limited, personal knowledge about wine, I suspect that sulphates, volatile acidity and residual sugar are the main features that influence the quality of the wine.
Probably pH, citric acid and fixed acidity will also be relevant to study in more detail in the following sections.
At this point, given the basic exploration, I didn’t see the need to create any new variables.
citric.acid as it seems to have two bins with very high count (at 0.0 to 0.02 and 0.46 to 0.48), thus suggesting some sort of bimodality.X because it was simply an index of the observations.quality variable from numeric to a factor with the correct order. This was necessary as this is inherently a categorical and discrete variable.Let’s plot all variables against each other and print a numeric correlation matrix to see have a clearer picture of their relationships
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## sulphates alcohol
## fixed.acidity 0.183005664 -0.06166827
## volatile.acidity -0.260986685 -0.20228803
## citric.acid 0.312770044 0.10990325
## residual.sugar 0.005527121 0.04207544
## chlorides 0.371260481 -0.22114054
## free.sulfur.dioxide 0.051657572 -0.06940835
## total.sulfur.dioxide 0.042946836 -0.20565394
## density 0.148506412 -0.49617977
## pH -0.196647602 0.20563251
## sulphates 1.000000000 0.09359475
## alcohol 0.093594750 1.00000000
Interesting…looking at the correlation matrix we see that density and sulphates seems to be more strongly correlated with other variables, particularly fixed acidity, citric acid and chlorides. Another interesting fact is that density is highly correlated with fixed acidity. Furthermore, the pairing plot shows that wines with higher quality tend to have a higher median alcohol and citric acid. In comparison, higher quality wines have lower median values for volatile acid and pH.
I want to see the aforementioned relationships with respect to quality in individual plots to have a better sense of their magnitude.
## Group.1 x
## 1 3 0.845
## 2 4 0.670
## 3 5 0.580
## 4 6 0.490
## 5 7 0.370
## 6 8 0.370
The median value of volatile acidity drops to 0.37 in higher quality wines
## Group.1 x
## 1 3 0.545
## 2 4 0.560
## 3 5 0.580
## 4 6 0.640
## 5 7 0.740
## 6 8 0.740
The median value of sulphates increases to 0.74 in higher quality wines
## Group.1 x
## 1 3 3.39
## 2 4 3.37
## 3 5 3.30
## 4 6 3.32
## 5 7 3.28
## 6 8 3.23
In this case, the median value does decrease but less significantly than the previously examined variables
## Group.1 x
## 1 3 9.925
## 2 4 10.000
## 3 5 9.700
## 4 6 10.500
## 5 7 11.500
## 6 8 12.150
Similar to sulphates, the median value of alcohol increases with the quality of wines. However, it must be noted that quality 5 has a number of observations with high alcohol content.
## Group.1 x
## 1 3 0.035
## 2 4 0.090
## 3 5 0.230
## 4 6 0.260
## 5 7 0.400
## 6 8 0.420
The amount of citric acid and the quality of wine seem to have a noticeable positive relationship, going from 0.035 in grade-3 wine all the way up to 0.42.
Now I want to see in more detail the relationship between some features (i.e. not with respect to quality, the dependant variable) which I previously noticed had a strong correlation. More specifically, I want to explore in scatter plots the relationship of fixed acidity with respect to density and pH as well as citric acid with pH.
## [1] 0.6680473
## [1] -0.6829782
## [1] -0.5419041
Some interesting relationships were discovered using bivariate analyses. First of all, I previously thought alcohol was not related to quality of wine. However, plotting quality vs alcohol gives a strong suggestion that the latter does affect the quality positively. That is, wines with higher quality tend to have a higher median alcohol. Having mentioned this, it is also worth noting that there were quite a number of observations that had high level of alcohol but not a high quality ranking. Furthermore, residual sugar, which I previously though was very relevant, doesn’t seem to have a strong relationship with quality of wine.
What I also noticed was that higher quality wines have lower median values for volatile acidity and pH
Given that the dependent variable (quality of wine) is categorical, I couldn’t get a correlation number. However, the boxplots suggest that both, volatile acidity and pH have a strong negative relationship with quality. That is, the quality level increases as the level of pH and volatile acidity decreases. In comparison, alcohol and citric acid depict a strong positive relationship with quality of wine
Let’s now plot the variables we just saw but in a multivariate way. First I want to examine density plots (colored by quality level) of variables examined in the previous boxplots. Then, I want to re-examine the relationship between fixed acidity, pH, sulphates and citric acid but this time adding quality level as color.
The density plots support the relationship between quality and the examined variables as shown in the boxplots from before. For example we see that alcohol distribution for high quality wines is shifted right. Also, at quality level 5 we notice a hump towards value 13% which also showed up in the boxplots.
The scatter plots colored by quality level don’t seem to reveal much info. However, the volatile acidity vs citric acid plot shows an interesting pattern: most of grade-7 wines have a volatile acidity below 0.4 and citric acid range of 0.25 - 0.75.
By now it seems clearer that citric acid is quite relevant to quality of wine. Now I’m going to plot it against a few varibles but this time I will facet them according to quality level to see if there is another interesting finding.
I don’t see anything unusual or additional information to what the other plots have shown.
Conducting some multivariate analyses of the main features identified seem to be pointing in the same direction of what was identified previously. That is, the distribution of good wines (7 and 8 rating) with respect to the selected features (see density plots) seems to stand out from lower grade wines. For instance, the distribution of citric acid for good quality wine is more left skewed than the lower grade wine distributions. A similar behaviour is seen with the sulphates density distribution when grouped by quality.
The objective of this exploratory analysis is to understand which features affect the quality of wine. In the course we learnt how to fit a linear regression on a numerical dependent variable. However, this time the dependent variable is categorical in nature, and unfortunately my knowledge of how to conduct this sort of model is extremely limited, hence I didn’t create a model.
Citric acid median is considerably higher for good quality wines (7 and 8), going from 0.035 g/dm^3 for low quality wines all the way up to 0.42 g/dm^3 which equates to 12x the concentration.
Alcohol median is notably higher for good quality wines, that is, grade 7 and 8. The latter having a median above 12%. However, there are some seemingly outlier wines that have high alcohol content but are graded as average (5 - 6). The most noticeable jump in median alcohol concentration is between level 6 and 7 which represents a 1 percentage point median increase.
Volatile acidity (i.e. acetic acid) median is lower for good quality wines (grade 7 and 8), both having a median of 0.37 g/dm^3 (less than half compared to the lowest quality wines). However, there are some outlier wines that have high acetic acid content and yet are graded as average (5 - 6).
The red wine dataset consists of 1599 observations of red variants of the Portuguese “Vinho Verde” wine.
The exploratory process began by understanding individual variables via summary statistics and histograms. This gave me a sense of the distribution of each individual variable but what really started to get me going and asking questions was after I plotted the correlation matrix as I was able to quickly see which variables seemed to have closer relationship with quality of wine such as alcohol, sulphates, volatility acid and specially citric acid. The following steps consisted of plotting in more detail the features I found interesting and trying to see if there was a pattern.
From the analyses conducted, and with my limited domain knowledge, I have noticed that three variables have a very close relationship with the quality of wines. Those variables are: citric acid, acetic acid and alcohol. These findings surface an interesting observation which is basically that not all acids are made equal, as some seem to be linked to higher quality wines (i.e. citric acid) and others seem to have a negative impact (e.g. acetic acid).
During my exploration I noticed that the number of observations varied widely from one quality level to another (5 and 6 had significantly more observations). I believe this could be a limitation on the data as ideally, one would be looking for enought data point at each quality level so as to make the statistical methods more robust. Moreover, I find quite strange that the dataset didn’t have any observations with quality levels below 3 and above 8. I have a feeling that this could also affect the robustness of statistical methods applied to this dataset.
My main struggle was not being able to fit a linear regression with quality as the dependent variable. This was due to the fact that at this point I don’t know how to apply a similar method when the dependent variable is categorial. Thinking how to expand my analysis, I believe that understanding how probit or logit regressions work would allow me to fit a model and obtain some numbers on the relationships I identified via plots.